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ABSTRACT 



A method of generating a 2-D extended image from a video 
sequence representing a natural 3-D scene first determines 
motion parameters for a camera that recorded the scene with 
respect to a bakcground object from the video sequence 
using a structurc-from-motion algorithm. The motion 
parameters include a rotation matrix, a translation vector and 
a depth map representing the depth of each point in the 
background object from the camera. Next from the motion 
parameters and depth map the 2-D extended image is 
generated for the background object as a composition of the 
images from the video sequence using a plane perspective 
projection technique. The background object may be layered 
as a function of depth and flatness criteria to form a set of 
layered 2-D extended images for the background object 
from the video sequence. 

17 Claims, 7 Drawing Sheets 
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2-D EXTENDED IMAGE GENERAllON FIG. 4 is a flow diagram view of a hierarchical matching 

FROM 3-D DATA EXTRACTED FROM A scheme according to the present invention. 

VEDEO SEQUENCE FIG. 5 is a general flow diagram view of a structure-from- 

motion (SFM) algorithm according to the present invention. 

CROSS-REFERENCE TO RELATED • ' p cJ?/ ^^"Jr ' r"''T^'°^ 

APPLICATIONS algorithm of FIG. 5 according to the present 

invention. 

Not applicable FIGS. 7A, 7B and 7C respectively are illustrations of an 

STATEMENT REGARDING FEDERALLY 10 image in a scene from a video sequence a depth map for the 

SPONSORED RESEARCH OR DEVELOPMENT T?^' ' segmentation mask for a foreground object m 

the image according to the present invention. 

Not applicable FIG. 8 Ls a flow diagram view of a 2-D extended image 

BACKGROUND OF THE INVENTION generator according to Ihe present invention 

15 FIG. 9 is an ulustration of the pasting of points of an 

The present invention relates to the processing of video image in a scene from a video sequence into a 2-D extended 

sequences, and more particularly to 2-D extended image image according to the present invention, 

generation from 3-0 data extracted from a video sequence of piG: 10 is a flow diagram view of a fiUer for producing 

a natural three-dimensional scene as observed through a ^i^^j^. pyramids according to the present invention, 

monocular moving camera. 20 r-jr^ -.-i- -n..- • • 

t> FIG. 11 is an illustrative view of the mapping of points 

Presently video sequences provide two-dimensional ^^^^^ ^^^^^ ^^^^^ ^^^^^ sequence into the 2-D 

images of natural three-dimensional scenes as observed extended image according to the present invention, 
through a monocular moving camera. Normally in order to 

produce three-dimensional objects or images a graphics . DETAILED DESCRIPTION OF THE 

generator is required for building such objects or images, 25 INVENTION 

which objects or images are then projected onto a two- ^ . , .. . • ^ ^ 

dimensionalplaneforviewingaspartoflhevideosequence. T^ie present invention describes the generaUon of a 2-D 

Three-dimensional manipulation of the three-dimensional ^^^^^^ed image, referred to as a world image, sprite or wall 

objectsorimagesisperformedby thegraphicsgenerator,but Paper, from a video of a natural 3-0 scene as observed 

the results arc seen as a two-dimensional object or image in 30 through a monocular moving camera. Tliis 2-D extended 

the video sequence represents in some way, usually with some loss, the 

, , ^ . ^ ^ * . information contained in the video representing a natural 

What IS desired is a method of generatmg 2-0 extended ^^^^ ^ ^^^^^ ^-^^^^ ^ q^^l^^y^ 

images from 3-D data extracted from a two-dimensional synthesized from this 2-D extended image. The 

video sequence. method described below is divided into two parts: 3-D 

BRIEF SUMMARY OF THE INVENTION camera parameter estimation and 2-D extended image gen- 
eration. 

Accordingly the present invention provides a method of Referring now to FIG, 1 a temporal sequence of raw video 

generating 2-D extended images from 3-0 data extracted -^^^^ representing a natural 3-D scene projected onto the 

from a video sequence representmg a natural scene. In an. ^^^-^^ ^^^^^ ^ ^^^^^ system. A scene cut detector 

image pre-processing stage image feature points are deter- identifies the temporal boundaries within the scene that 

mined and subsequenUy tracked from frame to frame of the ^^y.^ -^^^^ ^-^^^ sequence, and the sequence of 

video sequence. In a stmcture-from-motion stage the unage ^^^^^ ^^^^ ^-^^ ^^-^ temporal boundaries constitute 

feature pomts are used to estimate fliree-dimensional object ^^^^ ^^^^ j^^. j^^ages for the detected scene shot are 

velocity and depth. Followmg these stages depth and motion ^^^^ ^ ^ pre-segmentor 14 where objects are identified 

mformation are post-processed to generate a dense three- ^^^^ ^^^^ ^^^^^^ characteristics, such as color, shape and 

dimensional depth map. World surfaces, corresponding to velocity, by using manual, semi-automatic or automatic 

extended surfaces, are composed by lotegratmg the three- ^^^^^^^^ The pre-segmentation process also selects and 

dimensional depth map mformation. ^^^^y^ features on the objects from image to image within the 

The objects, advantages and other novel features of the ^(^^6 shot. For example in a video clip of a tennis match, a 

present invention are apparent from the following detailed tennis player may be identified as an object and the player's 

description when read in conjunction with the appended ^^^^^ qj. tennis racquet may be selected as features. The 

claims and attached drawing. information from the pre-segmentor 14 is input to a 3-D 

BRIEF DESCRIPTION OF THE SEVERAL ^'""^'^ parameter estimator 16 to estimate camera 

VIEWS OF THE DRAWING parameters, namely (i) the 3-D velocity of the objects in the 

scene and of the camera relative to some origin in 3-D space, 

FIG. 1 is a flow diagram view for the generation of 2-D and (ii) the depth of the objects from the camera. This depth 

extended images for an object from a video sequence ' and motion information is processed to generate a dense 3-D 

representing a natural 3-D scene according to the present depth map for each object, as is described below. A decision 

invention. 60 block 18 determines the "flatness" of the object, and also 

FIG. 2 is a flow diagram view of a camera motion provides feedback to the pre-segmentor 14 to assist in 

parameter estimator according to the present invention. separating objects on the basis of depth layers. The depth 

FIGS. 3A, 3B and 3C respectively are illustrations of an map information and the scene shot images are input to a 

image in a scene shot from the video image sequence, a 2-D extended image generator 20 to produce a 2-D extended 

mask for the image defining the object, and a 2-D extended 65 image for each object in the scene shot, 

image for the defined object according to the present inven- The 3-D camera parameter estimator 16 uses a slructure- 

tion. from-motion (SFM) algorithm to estimate camera param- 
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eters from the pre-segmented video, as described in greater objects, that are extracted from and tracked through the 

detail below. Currently SFM is typically used to estimate the scene shot being analyzed. For a static 3-D scene as recorded 

3-D velocities and depths of points of rigid opaque objects. by a moving camera, the effective motion of the feature 

In this appUcation SFM is used for two purposes: first lo points projected onto the camera's focal plane is related to 

estimate the 3-D parameters of independently moving s the camera's 3-D motion. In fact a camera moving along a 

objects, caUed foreground objects, such as a car moving 3-D trajectory induces an "equal and opposite" motion on its 

along a street in a video clip; and second to estimate the 3-D focal plane for any feature point. In the absence of camera 

parameters of the scene shot ^s background, called back- ^^^^^ ^^^^^^^ f^^j^^^ ^ ^^^^ Non-zero 

ground object such as the audience in the lenms match ^^^^^ -^^ ^^j^^^^ SPj^ ^^ti^^t^ 

videocup. The background object represents that part of the * i /• * i i 

i_ . ■ * * .1- ij • L 1 10 camera s 3-D motion parameters, also known as 

scene shot that IS static m the world, I.e., has a zero velocity „ » j * • * i • 

with respect to some point in 3-D space in a world coordi- /^omotion and trajectory The camera velocity is 

nate system, while the camera and foreground objects may described by the parameters of rotation and translation m 

be in motion with respect to the same point. However in the space. Once these parameters are known, the scene 

video dip the background object may appear to be in motion depth at the feature pomts may be estimated, since scene 

because of the camera movement. The video cUp therefore 15 depth is merely the distance between the feature point and 

records the velocity of the background object with respect to camera. 

the set of coordinate axes located at the focal plane of the The 3-D camera parameter estimator 16 has the following 

camera. This apparent 3-D velocity of the background object submodules, as shown in FIG. 2: (i) feature point extractor 

is opposite in sign but identical in magnitude to the velocity 24, (ii) feature point tracker 26, (iii) structure-from-motion 

of the camera with respect to the fixed point in 3-D space. 20 28, and (iv) depth map generator 30. The inputs to the 3-D 

Thus by estimating the apparent background velocity from camera parameter estimator 16 are raw video images, 

the input video, the 3-D trajectory and orientation of the denoted by I;^. and the corresponding "alpha" images, 

camera are determined with respect to the world coordinate j^j^^ted by A,.. The alpha image is a binary mask that 

system. determines the "valid" regions inside each image, i.e., the 

SFM currently reqmres that foreground objects be pre- 25 regions of interest or objects, as shown in FIG. 3 where FIG. 

segmented from the background object, and that feature 3a represents an image I, from a tennis match and FIG. 3B 

points on the different objects be chosen manuaUy. In this represents the alpha image A, for the background object 

application feature points on the objects are automatically -^u * • i ui i j * i u ■ 

V.J J. 1 J L-i XL . i ll with the tennis player blanked out. The alpha images are 

selected and tracked while these objects are still assumed to • j r *ir \ -.i. \. u u 

, , i-j.^uui J obtained from the pre -segmentor 14, either through a rough 

be pre-scgraented. SFM, when applied to the background 30 .^-segmentation or through user interaction, 

object, provides the 3-D camera velocity parameters and the ° " 

depth from the camera of feature points in the background Feature pointe are the salient features of an image such as 
object. Using the camera velocity parameters, the depth is corners, edges, lines or the like, where there is a high amount 
estimated at the feature points and the resulUng values are °f ™»8« mtensity contrast. The feature points are selected 
interpolated/extrapolated to generate a depth value for each ,c because they are easUy identifiable and may be tracked 
image pixel in the scene shot to provide a dense depth map. robusUy. The feature point extractor 24 may use a Kitchen- 
Hie 2-D extended image generator 20 combines the RosenfcW comer detecUon operator C. as is well known in 
results of the depth map compositing, using a planar per- ^ ^h-f operator is used to evaluate the degree of 
spective projection (PPP) model lo generate the image. The oomemess of the image at a given pixel locaUon. Cor- 
PPP model has nine parameters, which enables the estima- « "^f" ^"^ generaUy image features charactenzed by the 
don or prediction of the 2-D extended image given any intersection of two directions of image intensity gradient 
• 1 ,1 • — t f , ™ . ^u^r^f Zu^ ^^^^^ maxima, for example at a 90 angle. To extract feature points 
singje image, i.e., current frame or snapshot of the scene , . „ ^. , ^ r , . f . . 
shot. The PPP model parameters are also used to estimate or Kitchen-Rosenfeld operator is applied at each vabd pixel 

predict a cunent frame from the extended image. The depth P°«\^°? k 7^^ °^ i*!? °P'''^°' ? ! 

map, which is generated for pairs of successive images, is ,5 T^'^^^f ^'"^^S ^l^}"^' "^^.^'^^ ."'^^^^ » f 

combined with the PPP model parameters associated with P^'^^l P°^^^^°^ ^ ^^^^^ P^^"^ ^ ^[^(\y) 

the first image in the pair to generate approximate PPP ^'""^^^l ^L°^^^ .P^f ^ P^^f ^ neighborhood 

model parameters associated with the second image. ^Riesc i""'! neighborhood may be a 5x5 matrix 

approximate PPP model parameters are refined using a centered on the pixel posiUon (x y). To assure robustness the 

dyadic pyramid of the current image and the second image 50 f ^^T?. ^""l!^ L^^^^^ cornemess greater 

in the pair. The current image represents the composition of ^ threshold such as T -10. TTie output from the feature 

the images in the scene shot up to and potentiaUy including P°J^^ ^'^^f ^4 is a set of feature points { F^} in miage U 

the first image in the pair. The final PPP parameters along ^^^^^ ^^^^ corresponds to a "feature" pixel posiUon in I;^ 

with the second image are used to update the current Given a set of feature points F^ in image Ij^ the feature 

extended image to include the second image. 55 Point tracker 26 tracks the feature points into the next image 

As indicated above, structure-from-motion is a method by ^le scene shot by finding their closest match. " Closest 

which a 3-D rigid object velocity (motion) and depth "^f^^" defined by the pixel posiUon in l^,^, that maxi- 

(structure) may be estimated using different 3-D views of the * cross-correlation measure CC defined as: 

same scene. The present SFM implementation provides a ccK^^.A(xw+n)/^x(x'+;^/+n))-(2.y.(xmy+«))(2:™- 

robust estimation of the 3-D camera motion parameters and 60 «'^*+iC^+'n'.>''+nO)}/{'Si2/?7X^^(2«,J't(jt+m,y+«)) 
the depth at selected points in the 3-D scene shot from a 

mono-scopic video sequence, i.e., video recorded by a -(Z^^^A(^^m',y^^^^^^ 

monocular camera. Independently moving objects, i.e., fore- " ^^i^*™ ^"^^ 

ground objects, in the scene shot are pre-segmented as evaluated by overlapping a D^xD;, intensity neighborhood 

described above. 65 around the feature point at (x,y) in image Ijj. over an equal 

The 3-D camera velocity is estimated from a set of image sized neighborbood aroxmd a candidate target pbcel (x',y') m 

feature points belonging to static parts, i.e., background Ij^^^ for W-(2Dv+l)(2D;,+l). CC has a value of one if the 
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two neighborhoods being compared are identical, and minus 
one if they have exactly opposite intensity profiles. The 
feature point tracker 26 gives a "closest match" point in 1^+^ 
corresponding to each feature point in 1^. Each pair formed 
by a feature point in I;^ and its closest match in I^^+i is referred 
to as a "feature point correspondence" pair. 

In order to improve robustness and increase processing 
speed in the feature point tracker 26, four additional pro- 
cesses may be used: 

(i) Outlier removal due to low cross-correlation — every 
feature point correspondence pair must have an asso- 
ciated cross-correlation that is greater than a threshold 
T^. Feature points whose closest matches do not 
satisfy this threshold are eliminated. Further feature 
point correspondence pairs are sorted in descending 
order according to their cross-correlation values, the 
higher cross-correlations being the most reliable ones, 
and vice versa. A percentage of feature point cor- 
respondence pairs with the lowest cross-correlations 
are eliminated from the correspondence Ust. Represen- 
tative values for T^.^ and Tp are 0.9 and 40 respectively. 

(ii) Outlier removal due to lack of strength — a feature 
correspondence pair (Xjy)<~>(x',y') in Ijt and Ij^^ 
respectively is strong if a bidirectional matching crite- 
ria is satisfied by the feature points involved, i.e., the 
closest match of a point (x,y) in I^t must be the point 
(x',y') in Ijt^i, and conversely the closest match of the 
point (x',y') in must be the point (x,y) in I;^. 
Otherwise the correspondence is weak and is elimi- 
nated from the correspondence list, 

(iii) Hierarchical matching — in order to increase speed of 
establishing feature point correspondences, a hierarchi- 
cal approach is used in which the closest matches are 
searched for using a Gaussian pyramid of the images 
involved, as explained further below. 

(iv) Subpixel resolution match refinement — in order to 
increase accuracy of matching after the closest matches 
have been determined at pixel resolution, a local refine- 
ment of the matches is done at subpixel resolution, such 
as by adapting the method described in U.S. Pat. No. 
5,623,313- 

The process of finding closest matches using hierarchical 
matching is shown in FIG. 4. Depending upon the expected 
maximum image velocity, the number of pyramid levels 
required Ls estimated: 

where is the maximimfi expected image velocity, which 
may be provided by the user in lieu of a default, and S is a 
specified search range per pyramid level in pixels per frame, 
such as +/-2 range about the target pixel. The inputs to the 
hierarchical matching are the feature point set Fjj. in image I;^ 
and the L^^-level Gaussian pyramid of images I^j. and Ij^^^ . 
For the initial conditions Ffc^j-F^ and L-L„^. As shown in 
FIG. 4 the steps include: 

1. Decimate point in to level L, i.e., input feature point 
P;t=(x,.,y,.) becomes round(x./2^,y,-/2^)— if L-L^^, 
decimate points in Fjfc.^i to level L. 

2. Due to this decimation/rounding of point positions, F;^ 
points that are close together at the bottom of the 
pyramid may merge into the same point higher up in the 
pyramid. To save time and without loss of accuracy 
these "redundant" points are temporarily removed in 
the higher levels of the pyramid. An inheritance list, 
describing the mapping of all points F;^' into unique 
(non-redundant) points, is stored. 
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3. One-way match — closest matches are found for the 
xmique decimated feature points in F;^. The correspond- 
ing points Fj^+iKx'^yV) ^re used as "initial guesses'* for 
the position of the closest matches, i.e., the cross- 
correlation operator CC is applied between each point 
F;^' and each point (x',y') in a forward search neighbor- 
hood around the point F;^*;' ^® position (x',y') that 
maximizes CC is stored as F^t+j. In this step the images 
used are those of the L'''-level of the Gaussian pyramid. 

4. Using the inheritance list redundant points are restored, 
assigning to them their corresponding closest matches. 

5. Bring the points in Fj^^j one level down the pyramid so 
that feature point F'^^i-(x';,y'^ becomes (2x',-,2y'^. 

6. Set L-L-1. 

7. Repeat the above steps until and including when 
L=0 — when L=0 the data set Fjt^.i represents the closest 
matches of Fj^ in I;^ Slep 5 is not executed, and the 
outlier removal due to low cross -correlation is 
enforced. 

After the above "forward matching" is completed, the 
weak matches are removed by "backward matching", i.e., 
for each point F'j^^^ the cross-correlation operator is applied 
in all points (x,y) in a backward search neighborhood around 
Fjt'. If the point that maximizes CC is different from F;^', then 
this point is a weak match and is rejected. A less strict 
constraint is to accept such (x,y) if it falls within a radius 
^strong around Fjt*. If the backward search neighborhood is 
large, then the backward matching is done hierarchically to 
save time. The resulting output from the feature point tracker 
26 is a set of strong, robust feature point correspondence 
pairs contained in point sets F^. and F^^y 

The SFM submodule 28, as shown in FIG. 5, provides 
estimates for a camera rotation matrix R and a camera 
translation vector T, which are done following an "essential 
matrix" approach to SFM. Since this approach is sensitive to 
input noise, two different approaches may be used that result 
in accurate 3-D camera parameters. Both of these 
approaches require a "preprocessing stage" in which differ- 
ent sets of camera parameters are estimated. Then the best 
possible answer is chosen from the approach followed. 

The preprocessing stage 29 has as inputs the feature point 
correspondence pairs contained in Fj^ and Fj^^.^, the images Ij^ 
and /jt+i, and measured feature point velocity vectors, Vin^^'" 
(v'jt^A.^)=F**i-F>(x\-,y^)-(x.-,y,-). The initial conditions 
are: divide Ij. into approximately idential image blocks B^j^, 
i.e., rectangular image regions, for j={l, . . . ,N^o»>uy} so that 
each block "contains" approximately the same number of 
feature points Fj^ associated with it, and each point Fj^ 
belongs to one and only one block. Each block exclusively 
represents one region of I^^. This enforces that the feature 
points used in the estimation of the camera parameters span 
the whole extension of the input image, resulting in a more 
general, robust set of parameters. The preprocessing stage 
29 is shown in greater detail in FIG. 6, as described below. 
Step 1: From each block randomly draw one feature point 
correspondence pair according to a uniformly- 
distributed random number generator so that all feature 
point correspondence pairs within each block are likely 
to be chosen — the result is N^«„„ feature correspon- 
dence pairs that span over the whole of Ijt. 
Step 2: Calibrate the feature point coordinates using a 
coordinate calibration transfonnation so that (i) the 
spatial center of mass of the feature points is approxi- 
mately at the (0,0) coordinate origin and (ii) the feature 
points fall approximately with a radius of SQRT(2) of 
the center of mass — this results in a more stable esti- 
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mation of an essential (E) matrix, 
transformation is: 



The fastest, easiest 



I 0 



-1 I 



where (x^yy/^ becomes the "normalized" (x^,y^ image 
position, and N^^^j^ and N,,^;^ are the number of rows 
and columns of 1;^ respectively. 
Step 3: Compute the essential matrix E using the above 

normalized N^,-^ point correspondence pairs. 
Step 4: Extract the camera rotation matrix R and the 
camera translation vector T from the computed essen- 
tial matrix E. 

Step 5: Given R and T estimate the depth Z,- at every 
feature point F';^ 

The above steps should be repeated enough times to 
exhaust all possible combinations of feature points. How- 
ever such an approach may be intractable from a computa- 
tional point of view — the total number of operations may be 
too large to be feasible for realtime implementations. Instead 
a robust suboptimum statistical approach determines the 
choice of the "best" set of camera motion parameters from 
a subset of the possible combinations. Therefore the above 
steps are repeated N^omfr times, resulting in approximately 
Ncomfr different estimated sets of 3-D camera motion param- 
eters. 

The robustness and speed of the preprocessing may be 
enhanced by the following two steps: 

(i) After Step 3 and before Step 4 above, compute depth 
values Z for each feature point in the set of N^^. If 
at least one out of the depth values has a 
different sign, i.e., all N^j^ depths but at least one 
have a positive sign, there exists at least one outlier in 
the set of randomly chosen ^points points and, therefore, 
it is very likely that the resulting E matrix will be 
erroneous, hence making the remaining Steps fruitless. 
In such a case, especially when the level of outliers in 
the input data sets is high, throwing away this combi- 
nation of points both reduces overall processing time, 
for time is spent only on meaningful combinations, and 
increases the confidence in the final results. 

(ii) If Vin^t' is zero for all N^^^t, randomly chosen points, 
then according to this combination of points the camera 
is static, i.e., to save time instead of computing the E 
matrix and extracting camera parameters the result of 
this combination is automatically set to a motionless 
camera— R.diag(l,l,l) and T-[0,0,Of . 

After the preprocessing stage is completed, the N^^^, sets 
of R, T and Z,- represent candidate solutions for camera 
parameters. Then the best set is determined by one of the 
following two methods 31A or 31B. 
Method 1 

Given N^^^^ sets of 3-D parameters, compute the result- 
ing output image velocity vectors Voutj^' at every feature 
point F'jt- Then compute the "overall velocity estimation 
error": 

The combination that minimizes the overall velocity esti- 
mation error is chosen as the best set of camera parameters. 
This process provides accurate results even in the presence 
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of outliers in the feature point correspondence pairs that may 
have survived the outlier removal procedures. An "early 
exit" condition may be used to detect when a satisfactory 
estimate of the motion parameters is found and stop search- 

^ ing. The condition is that for every combination the per- 
centage of points F^' is computed that satisfy the inequality 
||Voutjt'-Vin;t1l=Tv where T^ is a specified maximum 
accepted velocity error in pixels per firame. If such percent- 

jo age is higher than a minimum required performance per- 
centage T^^ then no further search is required as this 
combination satisfies the minimum requirement. The camera 
motion parameters produced by this combination of feature 
points is accurate enough as long as Tp,^ is strict enough, 

^5 such as TyM)25 and T^^90. 

Method 2 

Given N^t^fc sets of 3-D parameters the best camera 
^ parameters are found in a statistical sense. The best trans- 
lation vector is obtained as 



25 



where t^"*, i^^ and ij" are the components of the T vector 
resulting from the m''' random combination of feature points. 
The resulting T vector is then normalized so that ||T||-1 (if 
l^Bly=t^=0 then Tis left as a zero vector). This normalization 
of T does not affect the results because T is only estimated 
up to a scaling factor. 

To estimate the best rotation matrix R the statistics of the 
angles of rotation, namely [ctpY9]. a, p and y describe the 
axis of rotation and are subject to the constraint that 

cos^(a)4cos^(P)+cos^(Y)= 1 

and 0 specifies the angle of rotation about such axis. 
To compute [apy]: 



45 



(i) Choose ttj, the angle a, p or y 
variance — the best value of such 
statistical mean. Then 



with the smallest 
angle is its own 
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(ii) Choose the angle a, p or y with the second smallest 
variance, the best value of such angle being its own 
mean constrained 

9 aHl/card(K))2^,ya2'" 

where Y={mlm€{l, . . . ,f^comb} and cos^(aj)+cos^(a2'") 
il}, card(Y) being the number of elements in Y. 

(iii) Choose ctg, the angle with the greatest variance, the 
best value being ag^arc cos SQRT(l-cos^a,-cos^a2). 

Finally the angle of rotation is determined by its own 
mean, namely e-(l/N,^fc)2..i->Ar«,^e'". 

After all four best angles are computed the rotation matrix 
R is formed which corresponds to such angles. 
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rtj + (1 -rtj)co^ njrt2(l -cosfl) -n3siiifl rtin3(l -co*ff) + /i2siiifl 
ninj{i - co^9) - rtiuoB rt2"3(i -co*^) + "1511^ /I3 -/i3)cosff 



where nj^cos a, n2«cos p and ngocos y. 

Either of the above methods produce a best estimate for 
the 3-D camera motion parameters. It is possible that a small 
subset of feature points have corresponding negative depths. 
Since the above estimated camera parameters give an accu- 
rate result for the large majority of feature points, the small 
subset of points with negative depths is tagged as outliers 
and discarded. The output of the SFM subraodule 28 is the 15 
best set of 3-D camera parameters and the depth at the 
location of the image feature points in F;^, 

Continuous depth for all points in an image are obtained 
by interpolation from the depths at the feature points. From 
the sets of feature points present in the image pair and an 20 
estimate of the depth at each feature point, and assuming that 
the feature points are chosen so that they lie relatively close 
to each other and span the whole image, the depth map 
generator 30 creates a 2-D mesh structure by interconnecting 
such feature points in which the feature points lie at the ^ 
vertices of formed polygons. The closer the feature points 
are to each other, the denser the resulting 2-D mesh struc- 
ture. Since the depth at each vertex of the 2-D stmcture is 
known, the depths at the points within each polygon may be 
estimated. In this way the depth at all image pixel positions 
may be estimated. This may be done by planar interpolation, 
as described below. 

A robust and fast method of generating the 2-D mesh 
structure is Delaunay triangulation. The feature points are 
connected to form a set of triangles whose vertices lie at 
feature point positions. Using the depth associated with each 35 
feature point and its corresponding vertex, a "depth plane" 
may be fitted to each individual triangle from which the 
depths of every point within the triangle may be determined. 

As a rule feature points at or very close to the boundaries 
of the image are not chosen to avoid having problems with 4Q 
artifacts in image boundaries and to avoid wasting time 
tracking points which may go out of the frame in subsequent 
images of the sequence. Therefore depths at or very close to 
the image boundaries cannot be obtained using the above 
procedure. Depth at or near image boundaries may be 45 
obtained by extrapolating from the known depths of the 
closest points near the target boundary point. A measure of 
the weighted average of the depths at neighboring points is 
used. In this measure the weights are inversely proportional 
to the distance from the target boundary point and the 50 
neighboring points with known depths. In this way the 
points closest to the target image boundary point have 
greater effect than more distant points. From the depth map 
generator 30 an image Z^. is obtained that contains the 
estimated depths for every pixel position in image I^. A ss 
typical result is shown in FIG. 7 where FIG. 7a represents 
the image Ijj. and FIG. 7b respresents the image Z^^, 

The density of the map is the user*s choice. If the desired 
output is a dense depth map, then the user may so specify 
and a high number of feature points is selected. Otherwise 60 
default values are applied and the process runs faster with 
fewer feature points. This is the end of the SFM algorithm 
16. From the input images I;^ and If^y the final products are 
the sets of feature points Fjt and F^^^i, the rotation matrix R, 
the translation vector T and the depth map Zj^. 65 

The decision block 18 has the role of determining, given 
the segmentation, 3-D camera parameters and scene depth 



information, for which objects the system generates 2-D 
extended images. The decision to generate 2-D extended 
images is based on the fact that the depth of each foreground 
or backgrotmd object exhibits a small spread relative to its 
average value. If is the average depth value for the ath 
object, and if AZ" is the largest difference in depth between 
pairs of object points, then the condition for a small depth 
spread is |AZ"|/Z" «1. This means that the ath object is 
considered as "flat" relative to a given collection of camera 
view points for which the depth values were computed. In 
general the flatoess condition occurs when the objects are far 
from the camera, such that Z^ is large, or when the objects 
are flat themselves. In addition to the flatness condition for 
each individual object, a ftirther decision is made on how far 
each object is relative to the other objects and how these 
objects are distributed in 3-D space according to their depth. 
The relative object depth is determined by the ratios of their 
average depth values, and their 3-D distribution is based on 
the relative ordering of the average depth values, i.e., in 
ascending order. 

If the objects are considered to be "flat" and they are 
ordered in a sequence of superimposed layers, then the 
decision block 18 determines that 2-D extended images 
should be generated for each layer. 

The decision block 18 also generates feedback informa- 
tion to the pre-segmentor 14, as shown in FIG. 1. This occurs 
when depth information about a given object is used to 
further segment it, such as separating a foreground tree from 
an image as shown in FIG. 7c. 

The input video may be thought to represent one or more 
physically meaningftil objects. A 2-D extended image may 
be generated for each object, and more commonly for the 
background object. The location and shape of the object of 
interest is available as an alpha image map, as shown in FIG. 
3fr. The alpha map is an image having the same dimensions 
as the input images Ij^, and may be generated through depth 
segmentation performed as part of the decision process or by 
other means as described above. For certain kinds of input 
video materials, where (a) object segmentation is performed 
by other means and (b) motion between consecutive frames 
is not significant, the outputs from the SFM algorithm, R, T 
and Z, are optional. For this simple subset of video a refine 
prediction box 34, as shown in FIG. 8, uses the refined 
parameters generated using a previous picture as an initial 
approximation (coarse or rough estimates). The amount of 
motion that the 2-D extended image generator 20 can handle 
without inputs from the SFM algorithm depends on (a) the 
number of levels of the pyramid used in refining parameters 
and (b) the presence or absence of local minima in the cost 
function used in refining parameters. Ijt(x,y) represents the 
intensity (luminance) at location (x,y), and Ujt(x,y) and 
V^x,y) represent the chrominance values. A value of A/x, 
y)=T„ means the pixel at position (x,y) belongs to the object 
of interest, otherwise it doesn't. The rotation, translation and 
depth maps are estimated by the SFM algorithm, as 
described above, and the median of the estimated depth map 
at valid pixel locations for the object is determined. 

An integrator 32 is used to generate the 2-D extended 
image Wj^ at time instances 1 through k by combining SFM 
information with PPP information. The intensity component 
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of the 2-D extended image is WIj^^ the chrominance com- 
ponents are WUjj, and WVj^, and the alpha component is 
WAjt- One goal of 2-D extended image generation is to be 
able to predict input images at time instances 1 through k 
from Wj^. A nine parameter plane perspective projection 
(PPP) is used to perform the prediction: 

^t(^y)-pr wvMx,yUk(pc.y)) 

where f;^ and gJ^, are defined through: 

A(^>')-{^*(0,0)*^+n(0,l)'>^P^(0^)}/{n(2,0)'*+P^(2,l)V+Pfc(2. 
2)} 

^*C^y)-{n(l>0)*x+i',(14)V''i(l,2)}/{i'.C2.0)'^+^i(24)V/*.C2. 
2)} 



10 



IS 



20 



Pjt(2^) is forced to be 1. 

The goal of 2-D extended image generation is to compute 
the planar image Wj^. and associated and pj^ for each k. At 
termination where k=K the final Wj^ and the history of Pj^ 
and p;^ may be used to synthesize images Ij^, Uj^, V^, 1 ^k^N. 
For each k the nine parameters, i.e., eight unknown com- 
ponents of the 3x3 matrix Pjt and pj^ are estimated such that 
the synthesized images are: 



Initialize an image "HITS" of dimensions equal to those of 
I*- 

Hrrs(j;>-)oiifA,(j«,y)gr„ 

For every new picture Ijt, k>l, from the input video 
sequence SFM is optionally performed between the previous 
image I^^ and the present image 1;^ to obtain T^_^, Rj^.j and 
Z;t-i for the previous image. From the previous image depth 
map a representative depth for interested objects in the 
previous picture is obtained as 

Z'-med[ian{all Zi^i(pc,y), where /it_i(j;;')Sr^} 

Then a 3x3 matrix Qjt_j is computed: 

Qjt_i is normalized so that the element in the last-row, 
last-column of Q,^_^f i.e., Qjt_i(2,2), equals 1.0: 

e^i^Q^i/e-ui(2.2) 

Given a point (x„,yj in normalized coordinates of image I;^^ 
Q;t_i is used to find an approximate location (x'„,y' J of the 
corresponding point in image I^^.j as: 

J^'„-{Ot-i(0,0)X+Gi.iC0.l)-y„+O*-x(0,2)}/{O^,(2,0)X+i3^i(2. 
l)'>'.+Q*-i(2^)} 



The cost function used to achieve this approximation is: 
MSE{P^pMyK){^.JUk{x,yyUx,y)r} 



such that 

gii^y)^ 
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where (x,y) in the above summation is such that A^x^y) 
and Ajt(x,y)^T„ and is the number of valid pixel 
locations (x,y). 

The first 2-D extended image generated is equal to the first 
image in the scene shot. 45 

HT/j-C/j 50 

and the estimated parameters are simply: 55 



1 0 0 
0 1 0 
0 0 1 



The normalized coordinates (x'„,y'„) may be converted to 
pixel coordinates (x',y*) through 

where N^^/, and N^^^ refer to the input image dimensions. 
The corresponding point (x",y") in V^k-i is: 

x'-fUx^y') 
y'-gk-iix'.y") 

Thus for a given point in Ij^ a corresponding point may be 
found in W^_^. This correspondence is used to obtain 
approximate Pjt as follows: 

1. For the four Conner points in Ij^ corresponding points are 
found in Wj^^^, i.e., for (x,y)-(0,0), {width-1,0), 
(0,height-l) and (width- 1, heigh t-1) the (x",y")s are 
foimd \ising the above equations. 

2. Using the four sets of (x,y) and corresponding (x",y") 
the linear equations for the eight unknown element of 
matrix P;^ defined by: 

j^V*M-{A(o,o)'*+^*(o,i)*y+n(o^)}/{/'.C2.o)w,(2,i)v+ 

>''-&M-{^t(l,0)W,(l,l)-^+/>^l,2))/{P^2,0)'x+P,(2,l)*>H- 
^*(2>2)} 

are solved with the understanding that Pjt(2,2)=l. 

The matrix P^ obtained serves as an initial approximation, 
which is refined using the following procedure. If SFM is 
not performed, Qj^_j is not available and P^^-i serves as the 
approximate P^t to be refined. 

The approximate Pj^ from above is refined using the 
intensity and alpha components of the input image and the 
presently existing 2-D planar image. The refinement 
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attempts to minimize a mean -squared -error over all possible 
Pjt and pjt- The refinement process is shown in FIG. 9. The 
full resolution input image is considered to be the level 0 
representation in a diadic pyramid. Separable filters arc used 
to generate the diadic pyramid. Other types of filtering 
schemes could equally well be used for this purpose. Given 
an image at level L, the image at level L-1 is obtained using 
the system shown in FIG. 10. Each lowpass filter in this 
figure computes outputs only at half the input sample 
positions. For example the lowpass filter may have a length 
of 5 and filter coefficients of (13/256, 64/256, 102/256, 
64/256, 13/256). This structure is repeatedly used to gener- 
ate an level pyramid consisting of images at levels 
0,1, . . . Xmttr- Note that this may be different from that 
used in the hierarchical matching described above. For 
example in one implementation L^=2. For the hierarchical 
refinement of p* pyramids of 1^, A^, WI^.j and WA^^.j 
are generated. 

Given Pjt for images at level L^Lj, Pj^ may be scaled so 
that it may be applied on images at level L^Lj through: 

Pi(2,0)=2^2-Li.p^(2,0) 
i';i(2,l)-2^^"^**n(2,l) 

Other elements of and remain unchanged in this 
scaling process. 

To refine Pj^ at a given level of the pyramid: 

1. Using the current values of Pj^ and pf„ compute l\ and 
A\ as above, and then the mean squared error, MSE 

2. Estimate a value p';^- 

where (x,y) a valid pixel location. 

3. Using the current and p\ compute 1';^ and A'^. and the 
resulting mean squared error, MSE(P;^,p';t) 

4. Decide between pj^ and p\ as follows. If MSE(P;^.,p'J^) 
<MSE(Pt,pjt), then p/^^^p^ and bestMSE=MSE(Pjt,p'jt); 
otherwise retain the value of pj^ and bestMSE-MSE 
(P^P*)- 

5. Set a counter ctr«0, bestPj^^P^^, TmpP^^oP^^. 

6. Initialize all elements of an Sx8 matrix A and an 8x1 
vector B to zero, set a counter overlap Cnt«0 and set an 
accumulator error=0.0. 

7. For each valid position (x,y) in image I;^ do: 
Compute (x',y') using current values of TmpPj^ and p^, 

x'=fjt(x,y) and y'^g^x^y). 
In general (x',y') are not integers but real numbers. 
If (x'y') falls widiin the boundaries of Wl^ and if WA^_^ 

at the four pixels (x',y') supporting the nearest pixel 

(x',y') indicate validity, do: 

overlapCnt=overlapCnt+l 

Through bilinear interpolation compute 
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Set error=error+{I;t(x,y)-r^x,y)}^ 
Compute partial differentials 

(fedp(l)Aaerror/6rmp/'t(04)-Cv*P*/Dcn)(6Wjt_i(x;:>'')/&r' 
dcdp(2)^bciioT/bTmpPii0^y{pt/Dcn)(bWIj,_^{x\yybx' 
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<ic(//)(3)A6cnor/67>npPt(1.0)-(jc*p^cn)(5W/t.i(x;y')/6y' 
<ferfX4)A&rrOT/67>n/if'fcCl,l)-(K»pj/Dcn)(6W7t_i(jr',y^/6y' 

(fe(/X^6crior/6rm/iPi(2,0)=(-jt*pi/Dcn){jc'*(6WA_i(jc;yO/e«> 
y(&W,_i(x;y')/6>'')} 

d«dX7)Aacnor/a7>n/tPtC2,l)-(-y*pfc^en){x'*(6WrA_,(x',y')/a^ 

y'*(&wr*-i(j^,>'W)} 

where the denominator "Den" is given by: 

Den=7>npi'jt(2,0)*x+7>npi»jt(2,l)*y+7>n;7i'i(2^) 

and the partial differentials are computed taking 
into account the non-integer values of x' and y'. 
Next perform the two loops: 
For k=0 to 7, for 1^=0 to 7, do: 

done. 

For k-0 to 7, do: 

B(k)mBik)-tTToj*dedp{k) 

done, 
done, 
done. 

8. If (error/overlapCnt)<bestMSE, 
bestMSE^error/ovcrlapCnt 

bestP;^oTmpP;t 
Otherwise do not change bestMSE and bestP;^. 

9. Update TmpPj^: 

TmpP^~TmpP^^T/i^ ^A)-^A 

where T, is a selectable stepsize for updating TmpPj^, 
such as 0.5. 

10. If the changes to TmpP;^ are significant, i.e., max{ 
(A^A)-^A^B}>T^ and if cnt<T^, continue the iterations 
by going to step 6, Otherwise the iterations may be 
terminated and go to the next step. Representative 
values for Ij and T^ are 1.0"^° and 10. 

11. If the fraction number of pixels predicted, 
(overlapCnt)/(Number of samples in Ij^ with A^^^T^) 
^Tyset Pjt^bestPjf Otherwise P^^ could not be changed 
based on the iterations just performed. Here T^. is a 
threshold which equalled 0.5 in one implementation. 

The quantity "error" above is not guaranteed to be mono- 
toniclly decreasing with iterations. For this reason matrices 
TmpP;t, bestPjt and quantity bestMSE are used. These quan- 
tities enable the capture of the value of P^j. that results in 
minimum mean-squared-error while performing the iterative 
procedure above. 

In the pasting block 38 the images I;^, Uj^, and Aj^ are 
combined with corresponding components of W;^-! using P^ 
and pj^ to generate W^: 

1 . Compute the corresponding positions of the four comer 
pixels (x,-,y,) of 1^^ in W^_i (x'„y',), as illustrated in FIG. 
U, using Pjt and 

(^<y^<m 

(xi»Ki)-(wi(!th-l,0) 
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(j»J2*yXwidth-l,heighM) 
(*j^j)=(0.1icighi-l) 

where width and height refer to the width and height of ^ 
Ijt. Values of (x^,y\-) arc forwarded to the refresh decider 
block 40. 

2. Compute max,{x',-}, max,{y\}, min,{x',-} and min,-{y',-}. 

3. If min,.{xV}<0, leftgrow-min^fx',-}, otherwise leftgrow- 

0. 10 

4. If min;{yV}<0, topgrowemin^jyV}, otherwise topgrowo 
0. 

5. If max,.{xV}iWidthofW;fe_j, rightgrow=maX;{x*;}+l- 
WidthofWj^.^, otherwise rightgrow=0. 

6. If max,{y'J^HeightofWj^_i, botlomgrowBmax^{y'/}+ 
l-HeightofWjt,i, otherwise bottomgrow«0. 

7. Pad Wjfc_i and HITS in the left with Icftgrow columns 
of pixels, in the right with rightgrow columns of pixels, 

in the top with topgrow rows of pixels, and in the 20 
bottom with bottomgrow rows of pixels. 'Riis involves 
padding all the four components of W;t_j and HITS 
with the above amounts. As padding material black 
pixels are used, i.e., ^^k-i HITS are 

padded with 0 while WUjs^j and WVjt_j are padded 25 
with 128 if the source material is 8 -bits per component. 

8. Update due to change in dimensions of W;^_j: 
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/»A(0,0)»Pi(0,l)+lcftgrow 
Fji{0a)-i'A(0.1)+leftgrow 
''A(0^)-^A(0^)+leftgrow 

P,(l,l)-P,(l,l)+topgrow 
/>,(l,2)-P,(l,2)+topgrow 

9. Project {li^\J^\f^A^} to make them look like Wjt_i by 
computing the inverse ^k,inv to Pje 

^*.^.v(0>0)-(^'*(l>l)-''*(2,2)-P*(2,l)"i'^l,2))/(P,(0,0)-P,(l,I)- 
i>,(l,O)-i>,(0,l)) 

i>*,,„^0,l)-(P^(0^)'/',(2,l)-/>*(2,2)*/>,(0.1))/(/>,(O,0)*P,(l,l)- 

P^,-„^0,2)-(P,(1,2)'P,(0,1)-P,CO,2)»P,(1,1))/(P,(0,0)-P^1,1)- 
P,(1,0)-P,(0,1)) 

P,(1,0)'P,(0,1)) 
P^,,^l,l)=(/'*(2^)-P*(0,0)-P,CO^)*P^C2,0))/(P^(0,0)'P,(l.l)- 

p,(i,o)-p,co,i)) 

^*.Ui;s)-(^'.(o»2)*^*(i»o)-nCM)-p*Co,o))/(p,(o,o)«Pt(i,i)- 
p,(i,a)*p,(o,i)) 

/'*./„.(2.0).(P,(2,l)*P,(J,0)-P,Cl,l)-P^2,0))/(P,(0,0)»P^(l,l)- 
P,(1,0)*P,(0.1)) 

/'i^.(2,l)-(''.(0,l)'/'it(2,0)-P,CO,0)*P,(2,l))/(P,(0,0)*P,(l,l)- 

P,(1,0)*P,(0,1)) 60 

P4.U2^)-1 

Next Ijt,Ujt,Vjt,A;t are projected to form WI'^ WU'^ 
WV^, WA'jfc that appear like WIj^.,, WU;^.,, WV^^.,, 
WA;t-a respectively: 
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where t^ ^^^ and g^^„^ are defined through: 
4f„vC^yH^u«/o.o)'x+p,^„,(o,i)*n^jt^Uo.2)yp^,,.(2.o)-x+Pt 

^*./,v<J^;y)-C/'w«v.(i.o)-x+Pt^(i.i)v+ni..(U))/(/*A>.(2.0)*x+p*. 
M2.i)*y+/'ti>..(2^)) 

Bilinear interpolation is used to obtain values of 
Ij^U^Vj^j^ at non-integer values of ftjnJO^yy) 
&,mv(x,y). 

10. Combine WI^ WU'a,WV^ WA^with W^.j, V/Uj^.,, 
WV^_,, WAjt_i through weighted summation. Specifi- 
cally if WA'jk(x,y)iT^ and WAjt_ilT„, 

H7,(;i;y)-(1.0-Y,)*W^_x(^yH/*W/',_,C;«,3') 
WU,{x,y)-il,Q^J*WUU^y)^.*WU'^,{x,y) 
WV^ix.y'Hl .0-Y.)* WV^i(x,y)+y^' Wr^,{x,y) 
WA^x,yy{l ,0-y,r WA,_,{x,y)-^y, * ^A\-My) 

Hrrs(x;>')-Hrrs(*iy)+i 

where 7,-, y^, Yv ^nd are selectable scalars in the range 
0 to 1. Otherwise if WA'^(x,y)^T„ 

m{x,yywr,,(pcy) 

WU^{x,y)r.WU'^x,y) 

WV.ix^yy-WV'^.f^y) 

WA,ix,y)«WA\(x,y) 

Otherwise 

m/,{x,y)'mjj^^{x,y) 
WV,{x,yyWV,_,{x,y) 
WA^x,y)-WA^,(x.y) 

Some special cases of the scalars are: 

(a) Y«0. This means copying or propagating old pixel 
values available in W^_i to Wj^. as much as 
possible, implying that older values in the planar 
image are given importance and are retained. 

(b) Y-0.5. This gives equal weights to the accumu- 
lated information from old input images and the 
information in the new input image. 

(c) Y=1.0/\^2. This means the contribution coming 
from any particular image in the planar image falls 
off exponentially as the process goes through the 
sequence of input images. This particular value 
provides the largest value with convergence prop- 
erty to the temporal integration, i.e., summation, 
happening at each position of the planar image. 

(d) Y"!- This means destroying old pixel values in 
Wj^_j as much as possible and copying the new 
information available in I;^ into the planar image. 

(e) Y-VHn^(x,y). This value give equal weight to 
each image that contributes to location (x,y) in the 
planar image. 
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The decision to terminate the evolution of the current 2-D 
planar image and start a new planar image is made by the 
refresh decide block 40, and is based on several criteria: 

1. End of Sequence. The process is terminated if all the 
input images in the input video have been processed. 

2. Scene Change. If a scene change is identified by the 
scene cut detector 12, a new planar image is created for 
the subsequent scene shot. 

3. Memory Constraints. A decision is made to refi"esh the 
planar image if any of the following are true: 

(a) WidthofW;,_,^T,.,^, 

(b) HeighlofW,_,^T;,,, 

(c) AreaofWjt.,^T_„ 

where the thresholds are selectable, for example, 
may be 2048, 2048 and 2^^ respectively. 

4. Excessive camera zoom-in or zoom-out. If there is a 
large zoom-out happening in the video with respect to 
the first image, the planar image size grows big, even 
after preprocessing only a few input images, making 
the planar image inefiScient for compression purposes. 
ALso the resulting planar image is blurred. On the other 
hand if there is a large zoom-in happening with respect 
to the first image, the updates to the planar image are 
small from picture to picture. The process of pasting 
onto the planar image is quite lossy for the details 
(higher spatial frequency information) present in the 
input image. Therefore when a large zoom-in or zoom- 
out is detected with respect to the first input image, a 
decision is made to refresh the planar image. Based on 
the values of (x'„y',) computed above in the refinement 
process, the deteaion of the presence of large zoom-in 
or zoom -out is made as follows: 

(i) denoting by width and height the widths and heights 
of Ijk, a large zoom-in is determined if any of the 
following are true: 

(a) min{SQRT((x'o-x'3)My'o-y'3)')»SQRT((x\-xg 
'+(y'a-y'2)')}^(T„.n.*height) 

(b) min{SQRT((x'o-x',f+(y'o-y\)'),SQRT((x'3- 
x'^H(y'3-y'^^)}i(T„^^*width) 

(c) area of the polygon {(x'o,y'o),(x'i,y i),(x'2,y'2), 
(x'3,y'3)}^(T^/«.*area) 

where the thresholds are selectable, i.e., all equal 
to 0.5 in one implementation. The height, 
width and area (height* width) refer to Ij^. 

SimDary a large zoom-out is identified for any of 
the following: 

(a) max{SQRT((x'o-x'3)^+(y'o-y'3)'')>SQRT((x'i- 
x'z)'+(yi-y2)^}^T_.'height) 

(b) max{SQRT((x'o-x^)My'o-y'i)''),SQRT((x'3- 

x'2)'+(y'3-y'2)')}^Cr.axw* width) 

(c) area of the polygon {(x'o,y'o),(x'i,y ),(x'2.y'2)» 
(x'3,y'3)}i(T„.x.*area) 

where the thresholds are selectable, such as all 
being 2. Finally excessive zoom-in or zoom- 
out may also be detected by accumulating the 
3-D translation vector t^ components which 
indicate 3-D zoom-in or zoom-out, depending 
on their signs. This is only available when the 
SFM algorithm is being used. 

5. Excessive camera rotation. If there is excessive rotation 
in 3-D by the camera with respect to the first image, 
there is excessive slew in the polygon area of the 
polygon {(x'o,yo)Xx'i,y'i),(x2»y2)»(x'3»y'3)}- This 
implies that the planar image is missing some details 
present in the input images. The excessive rotation of 
the camera may be detected either based on cumulated 
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R;t-i*s or based on the angles of the polygon. The 
rotation parameters are only available when the SFM 
algorithm is being used. If the minimum of four angles 
at vertices of the polygon is smaller than a given 
threshold, and excessive skew is detected, and a deci- 
sion is made to refresh the planar image. This threshold 
may be 30°. 

Thus the present invention provides for 2-D generation of 
extended images from a video sequence by processing 
pre -segmented objects from images in the video sequence 
using a surface -from-motion algorithm and, from resulting 
depth maps and camera motion parameters for each image, 
generating the 2-D extended image for flat or background 
objects. 

What, is claimed is: 

1. A methpd of generating a 2-D extended image firom a 
video sequence representing a natural 3-D scene comprising 
the steps of: 

for a background object segmented from the video 
sequence, determining motion parameters for a camera 
that recorded the video sequence with respect to a 3-D 
coordinate system, and from the motion parameters 
determining a depth map representing the depth of each 
point in the background object from the camera; and 

from the motion parameters and depth map, generating 
the 2-D extended image for the background object as a 
composition of contiguous images from the video 
sequence, wherein the determining step comprises the 
steps of: 

extracting feature points for the background object 
from an image of the video sequence; 

tracking the features points for the background object 
into a next image of the video sequence to produce 
feature point correspondence pairs; 

performing a structure -from-motion algorithm on the 
feature point correspondence pairs to produce a 
rotation matrix and a translation vector as the motion 
parameters as well as a depth value for each feature 
point in the image; and 

from the depth values generating the depth map. 

2. The method as recited in claim 1 further comprising the 
step of pre-segmenting the background object from the 
video sequence prior to the determining step. 

3. The method as recited in claini 2 further comprising the 
step of detecting scene cuts in the video sequence to divide 
the video sequence into a plurality of .scene shots, each scene 
shot being processed separately by the pre-segmenting, 
determining and generating steps. 

4. The method as recited in claim 1 wherein the tracking 
step further comprises the step of removing outliers from the 
feature point correspondence pairs prior to the performing 
step. 

5. The metiiod as recited in claim 1 wherein the perform- 
ing step comprises the steps of: 

pre-processing the feature point correspondence pairs to 
produce a plurality of sets of estimated motion param- 
eters; and 

selecting from among the sets of estimated motion param- 
eters a best set as the motion parameters. 

6. The method as recited in claim 5 wherein the selecting 
step comprises the step of computing an overall velocity 
estimation error from input and output velocity vectors for 
every feature point, with the combination that minimizes the 
overall velocity estimation error being selected as the best 
set. 

7. The method as recited in claim 5 wherein the selecting 
step comprises the step of statistically finding the best set 
firom among the sets of estimated motion parameters. 
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8. The method as recited in claim 1 further comprising the 11. The method as recited in claim 10 wherein the 
step of deciding from the motion parameters and depth map predicting step comprises the step of integrating the current 
whether to proceed with the generating step based upon a version of the 2-D extended image with the motion param- 
flatness criterion for the object. ^ters to produce the coarse predicted image. 

9. pie method as recited in claim 8 wherein the deciding S ^ ^^^^^^ ^ ^^^.^^^ ^ ^^^^ ^^^^^^ comprising 
step ftirther comprises the step of Providnig a segmentation ^^^^ ^-D extended image relative to a first 
signal for use xn further segmentmg the background object. ... , 

10. A method of generating a 2-D extended image from a ^^h^^^^'*' 

video sequence representing a natural 3-D scene comprising method as recited in claim 10 further composing 
the steps of: lo deciding from the output of a scene-cut detector to which the 

for a background object segmented from the video video sequence is input whether a scene-cut has occurred as 

sequence, determining motion parameters for a camera the end condition. 

that recorded the video sequence with respect to a 3-D 14. The method as recited in claim 10 further comprising 

coordinate system, and from the motion parameters the step of pre-segmenting the background object from the 
determining a depth map representing the depth of each 15 video sequence prior to the determining step, 

point in the background object from the camera; and 15, fhe method as recited in claim 14 further comprising 

from the motion parameters and depth map, generating the step of detecting scene cuts in the video sequence to 

the 2-D extended image for the background object as a divide the video sequence into a plurality of scene shots, 

composition of contiguous images from the video e^ch scene shot being processed separately by teh pre- 

sequence, wherein the generating step comprises the segmenting, determining and generating steps. 

steps 0 . r t. 1 J L - 16- The method as recited in claim 10 further comprising 

predicting a next image of the background object as a , ri-i.^. • f.f 

coarse predicted image from a current version of the ^'^P °^ ">e moUon parameters and depth 

2-D extended image; to proceed with the generating step based upon 

refining the coarse predicted image to generate a pre- ^ * flatness criterion for the object, 

dieted image; l'^- The method as recited in claim 16 wherein the 

pasting the predicted image to the current version of the deciding step further comprises the step of providing a 
2-D extended image to generate the 2-D extended segmentation signal for use in further segmenting the back- 
image; and ground object. 

repeating the predicting, refining and pasting steps until 

an end condition is achieved. * ♦ * ♦ * 
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